Part I - Determinants of Loan Outcome Status in Prosper P2P Loan Data

by Cvetana Decheva

  1. Introduction
  1. Preliminary Wrangling
  1. Dataset Description and Variables of Interest

3.1. Structure of the Prosper Dataset

3.2. Main Feature of Interest: Loan Outcome Status (Repaid vs. Not Repaid)

3.3. Variables that Support the Investigation of the Loan Outcome Status

3.3.1. Borrower Assessment

3.3.2. Loan Characteristics

3.3.3. Borrower Characteristics

3.3.4. Borrower Credit History

3.3.5. Borrower Indebtedness

  1. Cleaning and Wrangling Issues (Description)
  1. Cleaning and Wrangling Solutions
  1. Univariate Exploration

6.1 Preliminary Exploration

6.2 Feature Engineering

6.2.1. Categorical Transformation of Highly Skewed Count Data

6.2.2. Categorical Transformation of Highly Skewed or Multimodal Continuous Variables

6.2.3. Categorical Transformation of Continuous Variables by Constant Increment

6.3 Univariate Exploration Organized by Variable Groups

6.3.1 Dependent Variable: Loan Outcome Status

6.3.2 Borrower Assessment Variables

6.3.3 Loan Characteristics

6.3.4 Borrower Characteristics

6.3.5 Borrower Credit History Variables

6.3.6 Borrower Indebtedness

  1. Bivariate and Multivariate Exploration

7.1. Relationships between predictor variables

7.1.1 Borrower Assessment Variables

7.1.2 Loan Characteristics

7.1.3 Borrower Characteristics

7.1.4 Borrower Credit History Variables, Borrower Indebtedness

7.1.5 Correlations across Predictor Variable Categories

7.2. Relationship of Outcome Variable to Predictor Variables

7.2.1 Borrower Assessment Variables

7.2.2 Loan Characteristics

7.2.3 Borrower Characteristics

7.2.4 Borrower Credit History Variables, Borrower Indebtedness

7.2.5 Chi-Square Test of Independence between Loan Outcome and Ordinal Predictor Variables

7.3. Relationship of Loan Outcome Status to Predictor Variables Grouped by Loan Purpose

7.3.1 Borrower Assessment Variables

7.3.2 Loan Characteristics

7.3.3 Borrower Characteristics

7.3.4 Borrower Credit History Variables, Borrower Indebtedness

  1. Conclusions

1. Introduction

Prosper Marketplace, Inc. is a San Francisco, California-based company in the peer-to-peer lending industry. Prosper Funding LLC, one of its subsidiaries, operates Prosper.com, a website where individuals can either invest in personal loans or request to borrow money 1.

[1](#prosperwiki) Prosper Wikipedia

The data set used in this report contains 113,937 loans with 81 variables on each loan. The last update was made on 03.11.2014

Data Source

Variable Dictionary

2. Preliminary Wrangling

We copied the original data for cleaning.

3. Dataset Description and Variables of Interest

back to table of contents

3.1 Structure of the Prosper Dataset

The original dataset contains 113937 records and 81 variables. The variables are described in a data dictionary. The links are provided in point 1 above.

3.2 Main Feature of Interest: Loan Outcome Status (Repaid vs. Not Repaid)

The loan outcome status variable contains multiple categories. They can be separated into ongoing (or current) loans and past (or completed) loans. There are several categories of past/completed loans, which can be compressed to two: repaid vs. not repaid.

3.3 Variables that Support the Investigation into the Loan Outcome Status

Serrano-Cinca, Gutierrez-Nieto & Lopez-Palacios (2015) 2 investigated the variables determining default in the database of another p2p credit enterprise. They grouped the variables together in the following manner:

3.3.1 Borrower Assessment

Borrower Assessment consists of customer credit rating and loan interest rate. The interest rate acts as another form of assessment of a customer's credibility and as such is expected to show a high positive correlation to the customer credit rating.

Prosper uses two different customer scoring systems (one consisting of 7 grades and another consisting of 10 grades) and, additionally, a customer credit score range provided by a consumer credit rating agency.

3.3.2 Loan Characteristics

The characteristics of a loan are its purpose and amount.

The loan characteristics are represented by the Prosper variables 'ListingCategory' (i.e., loan purpose) and 'LoanOriginalAmount'.

ListingCategory has 20 categories plus a "null" category ("Not available").

There is a fairly clear hypothesis related to the relationship between loan purpose and the probability of default, i.e., consumer loans (e.g., a wedding loan) are generally less risky than a loan for a small business.

The relationship between the loan amount and the probability of default is less clear (some studies show a positive correlation, while others show a negative correlation).

3.3.3 Borrower Characteristics

The borrower characteristics are income, housing and employment.

Prosper records data on monthly income ('StatedMonthlyIncome'), which is also represented as an ordinal variable ('IncomeRange'). Furthermore, a boolean variable ('IncomeVerifiable') signifies whether the stated income is verifiable by documents.

3.3.4 Borrower Credit History

Borrower credit history is characterized by:

1) Time since first loan (i.e., credit history lenght) and number of total and open loans;

2) Number of open revolving accounts and the monthly payment on those;

3) "Revolving Utilization": "Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit";

4) Current delinquencies, number of past delinquencies and amount of delinquencies on loans;

5) Number of inquiries by creditors (totally and for a current period);

6) Derogatory public records (Prosper saves data for the last 12 months and for the last 10 years).

The Prosper data set does not contain a "revolving utilization" variable. There is a variable 'BankcardUtilization' ("the percentage of available revolving credit that is utilized at the time the credit profile was pulled.") which could be construed as revolving utilization rate.

The rest of the credit history variables are represented in the Prosper data set.

3.3.5 Borrower Indebtedness

Borrower Indebtedness variables are the following: loan amount to annual income, annual payment to annual income and debt to income ratio.

The Prosper data set only contains one of these three variables: 'DebtToIncomeRatio'.

Additionally,

the loan term could be one of the factors determining default rate, especially in combination with other variables.

[2](#plosOnepaper) Serrano-Cinca, C., Gutiérrez-Nieto, B., & López-Palacios, L. (2015). Determinants of default in P2P lending. PloS one, 10(10), e0139427.

4. Cleaning and Wrangling Issues (Description)

back to table of contents

Cleaning/Wrangling Issue #1

A few ordinal variables of interest, including 'CreditGrade', 'ProsperRating (Alpha)' and 'IncomeRange', are of data type 'object'. We should change them to ordinal categorical variables. The levels of 'EmploymentStatus' should also be arranged in a meaningful way (although it is not an ordinal variable).

Cleaning/Wrangling Issue #2

There are three credit rating variables (apart from the professional rating scores 'CreditScoreRangeLower' & 'CreditScoreRangeUpper'):

Cleaning/Wrangling Issue #3

The loan term variable, 'Term', is of the data type 'integer', however, there are only three possible terms and we are not interested in mean/median loan term as outcome variable, which is why it should be regarded as an ordinal variable. Therefore, we created an additional ordinal variable, 'term_ordinal', with three ordered string values: '12 months', '36 months', '60 months'.

Cleaning/Wrangling Issue #4

The loan purpose is represented by the variable 'ListingCategory (numeric)'. The labels of the categories can be found in the variable dictionary. We created a new categorical variable 'loan_purpose' with labeled categories.

Cleaning/Wrangling Issue #5

There is no variable 'credit history lenght', but it can be calculated from these two other variables: 'FirstRecordedCreditLine' and 'ListingCreationDate'.

Cleaning/Wrangling Issue #6

Shrink the data set

Cleaning/Wrangling Issue #6.1

The main variable of interest, 'LoanStatus', has multiple values: 'Cancelled', 'Chargedoff', 'Completed', 'Current', 'Defaulted', 'FinalPaymentInProgress', 'PastDue'. However, in order to answer the research question about factors that determine default, we are only going to investigate loans whose outcome is known (closed loans). In the current data set, closed loans are coded as 'cancelled', 'completed' (meaning repaid), 'defaulted' and 'charged-off'. There is a difference between defaulted and charged-off, however, for our purposes, both can be merged into one category, 'not repaid'. The cancelled loans are a very small group, which does not really contribute to the investigation, therefore, it can be dropped.

Cleaning/Wrangling Issue #6.2

We are going to exclude closed loans where the borrowers were rated according to the old 7-level rating ('CreditGrade'):

When we inspect the distribution of the scores before and after July 2009 (Issue # 2), they seem to be slightly different. Unfortunately, we cannot be sure whether the difference is due to different borrower characteristics before and after 2009 or to different scoring criteria. Therefore, for the sake of correctness, we should only investigate scores from July 2009 on, because this is when the current rating system was first implemented.

Cleaning/Wrangling Issue #6.3

The data set contains a lot of variables that we are not going to investigate, therefore, we are going to save a new dataframe with less variables.

Cleaning/Wrangling Issue #6.4

Additionally, we are going to remove any rows containing nulls in the (independent) variables of interest.

Cleaning/Wrangling Issue #7

After shrinking the dataset to the records we want to analyze, the 'loan purpose' variable contains one empty category ('personal loan') and another one with very few records ('not available'). We are going to exclude both.

5. Cleaning and Wrangling Solutions

back to table of contents

Cleaning/Wrangling Issue #1

back to table of contents

The levels of 'CreditGrade', 'ProsperRating (Alpha)', 'IncomeRange' and 'EmploymentStatus' are unordered.

Cleaning/Wrangling Issue #2

back to table of contents

There are several borrower credit rating variables. Three of them were created by Prosper ('CreditGrade', 'ProsperRating (Alpha)'/'ProsperRating (numeric)', and 'ProsperScore'), and two are from a professional consumer credit rating agency ('CreditScoreRangeLower' and 'CreditScoreRangeUpper').

'CreditGrade', 'ProsperRating (Alpha)'/'ProsperRating (numeric)' have 7 levels, while 'ProsperScore' has 10 (according to variable dictionary).

Cleaning/Wrangling Issue # 2.1: 7-level score

'CreditGrade' is an ordinal customer credit rating by Prosper applied until July 2009.

'CreditGrade' has 7 levels plus NC, which probably stands for no credit history.

The customer rating applied after July 2009, 'ProsperRating (Alpha)', has the same categories (except "NC" is now NaN), but the proportions seem to be different:

When we compare the bar charts, we see that the older scoring system has heavier "tails" (relatively more customers were rated with extremely high or low grades) compared to the system used from July 2009 onward. Hence the decision to analyze only new data (since July 2009).

Cleaning/Wrangling Issue # 2.2: 10-level score

back to table of contents

'Prosper Score' is a custom risk score built using historical Prosper data (after July 2009).

'ProsperScore' is supposed to be a scale from 1 to 10 (worst to best rating). However, 1456 customers were rated with 11.

Looks like 11 = 10: most of the 1456 customers have an "A" or "AA" rating, similar to customers rated 10:

It would therefore be correct to replace 11 by 10.

Cleaning/Wrangling Issue #2.3: 'CreditScoreRangeLower' & 'CreditScoreRangeUpper'

back to table of contents

'CreditScoreRangeLower' and 'CreditScoreRangeUpper' are the upper and lower bound of borrower credit rating provided by a professional consumer credit rating agency. There is a perfect correlation between the lower and the upper bound, therefore, only the upper bound suffices for answering the research question about default.

Credit scores3 are computed using a formula that considers factors such as payment history, overall debt levels, and the number of credit accounts the individual has open. A score between 740 and 850 suggests the individual has been consistently responsible, while scores between 700 to 750 are considered above average. Individuals with low credit scores, below 600, can take steps to improve them such as making payments on time, cutting down debt levels, and maintaining a zero balance on unused credit accounts.

[3](#investopedia) Credit Score Ranges: What Do They Mean?

Scatter plot graph shows perfect correlation between the lower and the upper bound:

Cleaning/Wrangling issue # 3

back to table of contents

'ListingCategory (numeric)' is the variable that contains the loan purpose in numeric form. We need a variable containing the labels. The labels can be found in the variable dictionary.

Cleaning/Wrangling Issue # 4

back to table of contents

There are three possible loan terms:

'Term' is of the type 'integer', but it really is only of interest as a categorical variable, and since the terms vary in an ordered way, it should be ordinal. Therefore, we create a new variable 'term_ordinal'.

Cleaning/Wrangling Issue #5

back to table of contents

Credit history lenght can be calculated from 'FirstRecordedCreditLine' and 'ListingCreationDate'.

First, convert both variables of the type 'object' to 'datetime' type:

Now, we compute the difference between the dates in months.

Cleaning/Wrangling Issue #6

back to table of contents

Main variable of interest: Loan outcome status.

The loan repayment outcome is represented by the variable 'LoanStatus'.

'Completed', 'Chargedoff', 'Defaulted', and 'Cancelled' are the categories of past loans. For simplification, cancelled loans can be excluded, and the other three categories can be compressed to two (with charged-off and defaulted loans as a single category): repaid (1) vs. not_repaid (defaulted/chaged-off, 0). 'FinalPaymentInProgress' could be added to the first category (repaid, 1).

Cleaning/Wrangling Issue 6.1: Remove current loans and 'Cancelled' category

back to table of contents

A category containing only 5 data points, such as the cancelled loans is not informative, therefore it can be removed. For this purpose, we are going to save all categories of past loans (including final payment in progress) in a new variable 'loan_status'.

Next, in order to separate the completed loans (repaid or not repaid) from the current loans, we create a new binary variable 'completed01'. In it, past loans (repaid, final payment in progress, defaulted, chargedoff) are signified by 1, and the rest by 0:

Next, we use the variable 'completed01' to save a new df, 'completed_df', consisting only of completed loans (repaid, final payment in progress, defaulted, chargedoff):

In the new 'completed_df', we create a binary variable 'status_bin' for repaid loans ('completed', 'final payment in progress' = 1) vs. unrepaid loans ('defaulted, 'charged-off' = 0):

Cleaning/Wrangling Issue #6.2: Exclude Records Made before July 2009 from 'completed_df'

back to table of contents

The old Prosper credit rating variable 'CreditGrade' was used until July 2009. Previously, we compared the distribution of the old and the new Prosper credit rating ('ProsperRating (Alpha)/ (numeric)') and found out that they were different, possibly due to different criteria applied to borrowers. Therefore, we are now going to exclude all older records from the 'completed_df' and save the new records in a new df, 'completed_2009'.

There are no records with the old rating left:

Cleaning/Wrangling Issue #6.3: Exclude Variables from 'completed_09'

back to table of contents

In order to shrink the dataframe, we exclude (some of) the variables that we are not going to investigate.

Now there are 59 variables left (including the new ones that we created by transformation):

Cleaning/Wrangling Issue #6.4: Deal with Nulls in 'completed_09'

back to table of contents

The 205 loans without closed date are of the category "final payment in progress":

The 2988 records with no 'DebtToIncomeRatio' are possibly due to division by zero (i.e., zero income in the denominator).

404 records with null in 'DebtToIncomeRatio' have a zero in 'StatedMonthlyIncome.

However, the Prosper data dictionary states that 'This ['DebtToIncomeRatio'] value is Null if the debt to income ratio is not available.' Therefore, the rows with nulls in can be excluded. The same goes to 'EmploymentStatusDuration'. The df 'completed_09' contains enough records (26,210).

The new dataframe without these missing values is 'completed_09_01'.

The null records from both variables are now cleared:

Check what is left of the nulls in the rest of the variables:

Cleaning/Wrangling Issue #6.5: Create dummy outcome variable with labels

back to table of contents

Create a dummy variable (repaid vs. not repaid) by copying the variable 'status_bin' with labels 'Repaid' vs. 'Not repaid'.

Cleaning/Wrangling issue #7

back to table of contents

The category "Personal loan" contains 0 records and the category "Not available" contains 7 records. Both can be removed:

Check after cleaning:

Number of records in 'completed_09_01'

6. Univariate Exploration

back to table of contents

6.1 Preliminary exploration

The majority of the variables in the Prosper data set which correspond to the ones in the the theoretical outline of Serrano-Cinca, Gutierrez-Nieto & Lopez-Palacios (2015) are continuous (generally) and some are (practically) discrete. Here, we are presenting distribution shapes.

Practically all distributions are positively skewed, except the one of the interest rate ('BorrowerRate') which is multimodal ('LoanOriginalAmount' is also multimodal, albeit with one defined peak).

The distribution of two variables resembles visually a log-normal distribution ('StatedMonthlyIncome', 'OpenRevolvingMonthlyPayment').

There is a large group of variables which contain count data2 ('CurrentCreditLines', 'OpenCreditLines', 'TotalCreditLinespast7years', 'OpenRevolvingAccounts', 'CurrentDelinquencies','DelinquenciesLast7Years', 'InquiriesLast6Months', 'TotalInquiries', 'PublicRecordsLast10Years', 'PublicRecordsLast12Months') and one continuous variable has so few values that it is practically discrete ('CreditScoreRangeUpper').

[2](#countdata) How to Deal with Count Data

One of the variables, 'DebtToIncomeRatio' is censored at the positive tail (at 1001%). From variable dictionary: "This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%)."

6.2 Feature engineering

back to table of contents

The purpose of this report is to illustrate the relationship of the variable of interest to all possible predictors, therefore a relatively easy solution would be to convert all variables into ordinal:

6.2.1 Categorical Transformations of Highly Skewed Count Data

back to table of contents

An easy solution would be to use pandas.qcut(), however, for these variables the zero is an important value and it is often the mode, therefore, we would like to save it as a separate category and to split the rest of the data by percentiles. Therefore, the method of choice is numpy.select().

We chose to transform the scores by putting the zeros into a separate category and the rest of the values as follows:

After transforming all count variables into categorical, we are going to transform them into ordinal variables.

1) 'CurrentCreditLines'

back to table of contents

2)'OpenCreditLines'

back to table of contents

3)'TotalCreditLinespast7years'

back to table of contents

4)'OpenRevolvingAccounts'

back to table of contents

5)'CurrentDelinquencies'

back to table of contents

6)'DelinquenciesLast7Years'

back to table of contents

7) 'InquiriesLast6Months'

back to table of contents

8) 'TotalInquiries'

back to table of contents

Inquiries from creditors

9) 'PublicRecordsLast10Years'

back to table of contents

Derogatory public records

10) 'PublicRecordsLast12Months'

back to table of contents

Change data type to ordinal

6.2.2 Categorical Transformations of Highly Skewed or Multimodal Continuous Variables

back to table of contents

1) Professional credit rating ('CreditScoreRangeUpper')

As the histogram shows, some intervals of the continuous variable 'CreditScoreRangeUpper' are empty, therefore the interval which we set at 14, is doubled at certain points of the distribution.

2) 'LoanOriginalAmount'

back to table of contents

The levels of the ordinal variable 'loan_amount_int' increment by \$1,000 up to \\$26,000. All values from \$26,000 to \\$35,000 are lumped together due to small group size.

3) 'OpenRevolvingMonthlyPayment'

back to table of contents

The distribution seems similar to log-normal. Evenly-spaced intervals would result in incomparable group sizes. Instead, we can do something similar to a log-transformed scale: 0 as a separate category, 1-49, 50-99, 100-299, 300-499, 500-999, 1000-2999, 3000 or more.

4) 'AmountDelinquent'

back to table of contents

'AmountDelinquent' is extremely positively skewed and 86% of the values are zero.

5) 'BankcardUtilization'

back to table of contents

6) 'DebtToIncomeRatio'

back to table of contents

Variable dictionary: "This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%)."

Create ordinal variable 'dti_ordinal'. Define levels by increment of 10% until 50%, then 50-100%, '100-200%', '200% or more': 9 categories.

7) 'EmploymentStatusDuration'

back to table of contents

The levels of the interval variable 'emp_dur' increment by 24 months until 528 months; the highest values are lumped together.

8) 'CreditHistoryMonths'

back to table of contents

We create a variable whose levels increment by 24 months up to 584 months and lump the highest values together.

6.2.3 Categorical Transformation of Continuous Variables by Constant Increment

back to table of contents

Interest rate ('BorrowerRate')

The interest rate in completed credits after July 2009 (df: 'completed_09_01') is between 4% and 36%.

It can be transformed into an ordinal variable, whose levels increment by 2%, resulting in 16 levels.

6.3. Univariate Exploration Organized by Variable Groups

back to table of contents

according to Serrano-Cinca, Gutierrez-Nieto & Lopez-Palacios (2015)

6.3.1 Dependent Variable: Loan Outcome Status

Variables: 'LoanStatus' → 'loan_status' with 4 categories (completed, finalPaymentInProgress, charged-off, defaulted) 'status_bin' → 'repaid_yn' (2 categories: 'Repaid', 'Not repaid')

Question: What was the number of repaid vs. not repaid loans within the period of interest?

1) How many records are there?

2) Check fist and last record in data frame.

3) The counts of the 4 categories ('loan_status')

4) The counts of the 2 categories ('repaid_yn')

Answer (loan outcome status):

Between 13.07.2009 and 28.02.2014 there were 23,222 completed loans (with data on each of the variables of interest, cancelled loans excluded). Out of those, 17,892 were repaid (including 189 for which the final payment was in progress) and 5,330 were not repaid (885 defaulted and 4,445 charged-off). The difference between repaid and unrepaid loans was 3.38-fold.

6.3.2 Borrower Assessment Variables

back to table of contents

1) Interest Rate

Variable: 'BorrowerRate'

Question: What is the shape of the distribution, what is the mode, what are the minimum and maximum values of the interest rate in the sample of interest?

1) Minimum and maximum

2) Mode

3) Shape of distribution

Answer (Interest Rate):

The minimum interest rate is 4% and the maximum is 36%. The (number one) mode is 32%. The distribution is multimodal. It should be investigated, what other variables determine the distribution of the interest rate, especially, e.g., for what purpose and for that term were the loans given at the most popular interest rate of 32%.

2) Customer Credit Rating

back to table of contents

Variables: 'ProsperRating (Alpha)', 'ProsperScore', 'CreditScoreRangeUpper'

Question: How do the three kinds of borrower credit rating compare in terms of shape of distribution and mode(s)?

1) 7-level rating

2) 10-level rating

3) Professional customer credit rating (upper bound)

Answer (Customer Credit Rating):

The 7-level rating ('ProsperRating (Alpha)') is definitely unimodal, with the most highly spread value of D (D is the 5th best value, with HR being the worst and AA the best).

The 10-level rating ('ProsperScore') shows a more differentiated picture, with the peak of the distribution positioned to the right side at 8 (the rightmost score of 10 being the most positive rating).

The professional credit score ('CreditScoreRangeUpper'), in contrast to 'ProsperScore', has a positively skewed distribution whose peak is between the scores of 650 and 700 (mode = 679; median = 719). A comparison to the distribution of the same variable in the whole data set shows that completed loans do not include the lowest values of this variable (below 619). Generally, scores below 600 are considered bad.

6.3.3 Loan Characteristics

back to table of contents

Questions: How are loan purpose categories ordered in terms of popularity? What is the shape of the distribution, the range and the mode of loan amounts?

1) Loan Purpose

Variable: 'loan_purpose'

Percentage graph of loan purpose:

2) Loan Amount

back to table of contents

Variable: 'LoanOriginalAmount'

Answer (loan characteristics: purpose and amount):

The most popular purpose category in completed loans by far is debt consolidation (n = 11,839), followed by the "umbrella" category "other" (n = 4,243) and home improvement (n = 2,433). Business loans are next with n = 1,606.

The size of completed loans is between \$1,000 and \\$35,000. The median value is \$5,000 and the mode is \\$4,000. The distribution is positively skewed and multimodal.

The relationship between these two variables would be interesting to investigate (i.e., whether business loans are generally larger).

6.3.4 Borrower Characteristics

back to table of contents

Questions: What are the counts of the categories of employment status, what are the shape and measures of distribution of the employment status duration? What is the shape and measures of distribution of the stated monthly income; what are the proportions of different annual income categories, and what is the proportion of verifiable income?

1) Employment

Variables: 'EmploymentStatus'; 'EmploymentStatusDuration'

2) Income

back to table of contents

Variables: 'StatedMonthlyIncome'; 'IncomeRange'; 'IncomeVerifiable'

3) Housing

back to table of contents

Variable: 'IsBorrowerHomeowner'

Answer (borrower characteristics: employment and housing):

The most populous category of employment status is 'employed' (n = 15894), which unfortunately is not well defined against the other categories (i.e. 'full-time' and 'part-time'). The second category by count is 'full-time' (n = 6253). The 'self-employed' and 'not employed' categories are practically empty (n=1) and the 'retired', 'part-time' and 'other' groups are much smaller than the first two by count (n = 236, n = 174, n = 659).

The employment duration in months is positively skewed, with a minimum of zero and a maximum of 755 months; the median is at 65 months (5 years 5 months) and the mode is 3 months.

The range of stated monthly income in the sample of interest is between \$1.42 and \\$483,333.33. The distribution is strongly positively skewed with a median of \$4,750 and a mode of \\$5,000.

The largest category of annual income range by count is \$25,000-49,999. The next largest is \$50,000-74,999. All stated incomes are verified bar one.

As for housing, there are roughly as many homeowners as non-homeowners in the sample (12,402 vs. 10,816).

6.3.5 Borrower Credit History Variables

back to table of contents

1) Credit History

Variables: FirstRecordedCreditLine → 'credit_history_months'; CurrentCreditLines → 'current_credits'; OpenCreditLines → 'open_credits'; 'TotalCreditLinespast7years'

Question: What are the distributions of the continuous variables that characterize the borrower credit history (credit history lenght, number of current credits, number of open credits, number of credits for the past 7 years); what are the largest categories of the categorical variables (current credits, open credits)?

Answer (Credit History):

The credit history of the sample of interest ranges from 7.5 months to 686.6 months (ca. 57 years). The median is 185.4 months (ca. 16 years) and the mode is 77.5 months (ca. 7 years). The distribution is positively skewed.

The borrowers had between 0 and 59 current credit lines, median of 9 and mode of 8. The distribution is positively skewed. When the variable is translated into separate categories, the most popular range is between 9 and 12 current credit lines, followed by between 6 and 8 credit lines.

The open credit lines ranged between 0 and 48, median = 8, mode = 6, distibution was positively skewed. The most popular range of open credit lines was between 5 and 7, followed by between 8 and 10.

The total credit lines in the past 7 years ranged between 2 and 124, median = 25, mode = 20. The distribution is positively skewed.

2) Open Revolving Accounts

back to table of contents

Variables: 'OpenRevolvingAccounts' → 'open_accounts'; 'OpenRevolvingMonthlyPayment'

Question: What are the shapes and measures of distribution of the continuous variables (number of open revolving accounts and amount of open revolving monthly payment)? What are the largest categories of the categorical variable number of open revolving accounts by count?

Answer (open revolving accounts and monthly payment):

The borrowers in the sample of interest had between 0 and 47 open revolving accounts. The median is 6, the mode is 4. The distribution is positively skewed. The same variable presented as categorical showed that the most popular range of open revolving accounts was between 6 and 8, followed by between 4 and 5 and between 1 and 3.

The open revolving monthly payment ranged between \$0 and \\$5,155. The median was \$237. The mode was \\$0. The distribution is positively skewed.

3) Revolving Utilization

back to table of contents

Variable: 'BankcardUtilization' ("The percentage of available revolving credit that is utilized at the time the credit profile was pulled.")

Question: What is the shape of the distribution of revolving credit utilization percentage? What are the measures of the distribution?

Answer (Revolving Utilization):

The distribution is bimodal (the modes being 0% and 100%). The median was 55%. The maximum utilization was 250%. It would be interesting to investigate the borrowers whose utilization rate was 100% or more (and whether that predicts default rate).

4) Delinquencies

back to table of contents

Variables: CurrentDelinquencies → 'delinquencies'; AmountDelinquent → 'amount_delinquent'; DelinquenciesLast7Years → 'delinquencies_7y'

Question: What are the shapes and measures of distribution of the continuous variables number of current delinquencies on loans, sum of delinquencies for the last 7 years, and amount of payments on delinquencies; what are the most populated categories of the categorical variables?

Answer (delinquencies):

The current delinquencies of borrowers in the sample of interest ranged between 0 and 32, median = 0, mode = 0. The distribution is highly poitively skewed. The same variable presented as categorical showed that the largest category by count was the one of no delinquencies.

The delinquencies in the past 7 years ranged between 0 and 99, median = 0, mode = 0, strong positive skewness. Translated into categories: most borrowers had zero delinquencies, the next most populated category was 1-11 delinquencies.

The amount delinquent ranged between \$0 and \\$327,677, median = 0, mode = 0, strong positive skewness. When the variable is transformed into ranges, the largest category by count is \$0.

5) Inquiries by Creditors

back to table of contents

Variables: 'InquiriesLast6Months' → 'inquiries_6m'; 'TotalInquiries' → 'inquiries_total'

Answer (inquiries by creditors):

The inquiries by creditors about the borrowers, summed up for the last 6 months ranged between 0 and 22, median = 1, mode = 0, strong positive skewness.

Translated into ranges, the mode was again "None", followed by 1 inquiry.

As far as total inquiries are concerned, the range was between 0 and 74, median = 4, mode = 2, strong positive skewness.

The total inquiries presented as a categorical variables showed the most popular range was 2-3 inquiries, followed by 4-5 inquiries. The third larges group was between 6 and 9 inquiries.

6) Derogatory Public Records

back to table of contents

Variables: 'PublicRecordsLast10Years' → 'pub_rec_10y'; 'PublicRecordsLast12Months' → 'pub_rec_12m'

Answer (derogatory public records):

The derogatory public records summed up for the last 10 years ranged between 0 and 12, median = 0, mode = 0, strong positive skewness. Translated into ranges: most borrowers had 0 records, followed by 1 record and finally, by 2 or more.

Summed up for the last 12 months, the derogatory public records ranged between 0 and 4, median = 0, mode = 0, strong positive skewness. Translated into ranges: the bulk of the borrowers had no derogatory records, 231 had one or more.

6.3.6 Borrower Indebtedness

back to table of contents

Variable: 'DebtToIncomeRatio'

Description: "The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%)."

Answer (borrower indebtedness):

The debt-to-income ratio in the sample of interest ranged between 0% and 1001% (however, data are censored at the maximum, as the variable dictionary states). The median is 20%, the mode is 18%, the distribution is strongly positively skewed.

7. Bivariate and Multivariate Exploration

back to table of contents

7.1. Relationship between Predictor Variables

7.1.1 Borrower Assessment Variables

Spearman's correlation of all 4 variables (with 'ProsperRating (numeric)' as substitute for 'ProsperRating (Alpha)')

The correlations between the interest rate ('Borrower Rate') and the three credit rating variables are inverse (low rated borrowers are offered loans at high interest rates). The highest correlation is between interest rate and Prosper Rating (the 7-level rating, Spearman's rho = -.947). The correlations of the interest rate to the 10-level rating (Prosper Score) and to the professional credit rating (Credit Score Range Upper) are lower (Spearman's rho -.738, -.575)

7.1.2 Loan Characteristics

back to table of contents

1) Loan purpose vs loan amount

Loan purpose vs loan amount: all categories of loan purpose are characterized by high positive outliers (\$20,000 - 40,000) of loan amount, except two: 'vacation' and 'cosmetic procedure'. The highest outliers are in the categories 'business', 'motorcycle', and 'debt consolidation'.

2) Loan term vs loan amount

back to table of contents

Loan term vs loan amount: The medians of loans of 12-month and 36-month term are approximately equal. The median amount of 60-month loans is higher. 36-month and 60-month loans contain higher outliers (> \$35,000) compared to 12-month loans (< \\$28,000).

3) Loan term vs loan purpose

back to table of contents

7.1.3 Borrower Characteristics

back to table of contents

1) Homeownership vs. income range

Homeownership vs. income range: in the high income ranges (\$50,000 - >\\$100,000), home owners prevail. In the lower income ranges (<\$50,000) the situation is the opposite.

2) Employment status duration vs. income range

back to table of contents

Employment status duration vs. income range: there is a slight difference in the medians of employment status duration across the income levels. All groups are characterized by high positive outliers.

3) Employment status duration vs. homeownership

back to table of contents

Employment status duration vs. homeownership: homeownership tends to prevail among borrowers who have been in their current employment status for a longer time.

7.1.4 Borrower Credit History Variables, Borrower Indebtedness

back to table of contents

1) Time since first loan (i.e., credit history lenght) and number of total and open loans;

2) Number of open revolving accounts and the monthly payment on those;

3) "Revolving Utilization": "Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit";

4) Current delinquencies, number of past delinquencies and amount of delinquencies on loans;

5) Number of inquiries by creditors (totally and for a current period);

6) Derogatory public records (Prosper saves data for the last 12 months and for the last 10 years).

The Prosper data set does not contain a "revolving utilization" variable. There is a variable 'BankcardUtilization'

Borrower credit history variables and borrower indebtedness. Spearman's rho (depicted on the heatmap above) is a measure of correlation of ranked scores:

7.1.5 Correlations across Predictor Variable Categories

back to table of contents

7.1.5.1 Borrower assessment (credit rating variables) vs. borrower credit history variables and borrower indebtedness

Due to the prevailing non-normality among credit history variables, we are using the transformed variables instead of the original variables. Furthermore, some of these variables are highly positively correlated to each other, therefore, we use only some of them as examples.

1) current credit lines vs. interest rate over levels of credit rating variables

current credit lines vs. interest rate over levels of credit rating variables: the group of borrowers with no current credits (i.e., without credit history) contain less or no individuals with high ratings of any kind. Other than that, the relationship between interest rate and credit rating seems to be the same or very similar for any borrowers with existing credit history.

2) Open revolving monthly payment vs. interest rate over levels of credit rating variables

back to table of contents

Open revolving monthly payment vs. interest rate over levels of credit rating variables: The variable open revolving monthly payment is characterized by a trend which becomes obvious for revolving monthly payment of \$3,000 or more: this category of borrowers receives loans at a higher interest rate; furthermore, this category does not contain any individuals with the highest levels of professional credit rating. In conclusion, this revolving monthly payment becomes important in terms of interest rate once it hits the \\$3,000-treshold.

3) Delinquencies summed up over the Last 7 Years vs. interest rate over levels of credit rating variables

back to table of contents

Delinquencies summed up over the Last 7 Years vs. interest rate over levels of credit rating variables: Interestingly, there is a noticeable difference among the credit rating scores with regard to the last category of delinquencies (43 or more over the last 7 years): it contains no individuals with the highest levels of professional credit score, while this trend is less noticeable in the 10-level Prosper Score, while the 7-level rating contains no individuals of the highest category, AA. In conclusion, the credit rating with more levels is more sensitive towards the number of delinquencies.

4) Total Inquiries by creditors vs. interest rate over levels of credit rating variables

back to table of contents

Total Inquiries by creditors vs. interest rate over levels of credit rating variables: The 10-level Prosper Score and the professional credit rating seem to be more sensitive towards the variable 'total inquiries by creditors'. This is especially noticeable in the last category (20 or more inquiries).

5) Derogatory Public Records summed up over the Last 10 Years vs. interest rate over levels of credit rating variables

back to table of contents

Derogatory Public Records summed up over the Last 10 Years vs. interest rate over levels of credit rating variables: Again, the professional credit rating seems to be more sensitive towards a credit history variable, such as the sum of derogatory public records over the last 10 years: the group of borrowers with 2 or more such records do not contains any individuals with the highest levels of credit rating.

6) Amount Delinquent vs. interest rate over levels of credit rating variables

back to table of contents

Amount Delinquent vs. interest rate over levels of credit rating variables: The high amount of delinquency on loans in combination with low credit rating drives the interest rate up. Again, the trend is more obvious with the professional credit score, where the highest credit rating levels are only present in borrowers with none or very low amont delinquent.

7) Bankcard Utilization rate vs. interest rate over levels of credit rating variables

back to table of contents

Bankcard Utilization rate vs. interest rate over levels of credit rating variables: The bankcard utilization rate ("the percentage of available revolving credit that is utilized at the time the credit profile was pulled") seems to visibly drive the interest rates up, especially in combination with a lower credit rating. This trend is visible in all credit rating variables, but especially with the professional credit rating.

8) Borrower credit history in months vs. interest rate over levels of credit rating variables

back to table of contents

Borrower credit history in months vs. interest rate over levels of credit rating variables: Among the borrowers with the shortest and the longest credit history, there are none (or less) borrowers with the highest levels of professional credit score. The relationship between the three variables, credit rating (of whichever kind), credit history lenght, and interest rate, is complicated and possibly mediated by other variables.

9) Debt-To-Income Ratio vs. interest rate over levels of credit rating variables

back to table of contents

Debt-To-Income Ratio vs. interest rate over levels of credit rating variables: Debt-To-Income Ratio seems to visibly drive the interest rates up, especially in combination with a lower credit rating. This trend is visible in all credit rating variables, but especially with the professional credit rating.

Borrower assessment (credit rating variables) vs. borrower credit history variables and borrower indebtedness: Conclusion

Due to having more categories, the consumer credit rating by professional credit rating agencies is more sensitive towards variables of the borrower credit history category (credit history lenght, number of loans, number of revolving accounts, credit card utilization rate, delinquencies on loans, and derogatory public records) and the borrower indebtedness category (debt-to-income ratio).

7.1.5.2 Borrower assessment (credit rating variables) vs. loan characteristics (loan purpose and amount)

back to table of contents

1) Loan purpose vs. interest rate over levels of 7-level rating (Prosper Rating)

2) Loan purpose vs. interest rate over levels of 10-level rating (Prosper Score)

3) Loan purpose vs. interest rate over levels of Credit Score

Loan purpose vs. interest rate over levels of borrower credit rating variables: the relationship between credit scores and interest rate seems very consistent despite different loan purposes. There are, however, certain categories of loan purpose that stand out:

1) The interest rates for business loans, debt consolidation loans, home improvement loans, "auto" loans, and "other" loans seem to have the least amount of outliers (judging by the short error bars); 2) Loans for a boat seem be lent at more favorable conditions than loans for any other purpose; furthermore, among customers who borrowed money for a boat were none with the lowest levels of the credit rating variables (HR/E; 1/2; 619-678).

4) Loan amount (ordinal) vs. interest rate over levels of 7-level rating (Prosper Rating)

back to table of contents

5) Loan amount (ordinal) vs. interest rate over levels of 10-level rating (Prosper Score)

6) Loan amount (ordinal) vs. interest rate over levels of Credit Score

Loan amount vs. interest rate over levels of borrower credit rating variables: there is a limit of the loan amount accessible related to the credit rating. Loans higher than approximately \$16,000 are not lent to low-rating borrowers. Furthermore, there are generally more outliers in interest rate in higher loan amounts, although the median interest rate seems to fall with the rising loan amount.

7.2. Relationship of Outcome Variable to Predictor Variables

back to table of contents

The outcome variable can be represented as percentage of unrepaid loans within levels of predictor variables. All predictor variables are transformed into ordinal. The outcome 'repaid_yn' is dichotomous.

A dataframe containing the proportions of unrepaid loans within the levels of the grouping variable and a graph displaying the same are printed by a custom function - repaid_props(initial_df, grouping_cat_var) - defined below.

We define the function in three modifications related to the form and size of the graph: 1) Graph with the outcome on the x-axis

2) Graph with the outcome on the y-axis

3) Extra-large graph with the outcome on the x-axis

The repaid_props function contains several steps:

1) A groupby object s, containing the proportions of repaid and unrepaid loans within the levels of the grouping variable

2) The series s is transformed into a dataframe df by the reset_index()-method

3) We want to display only the percentage of unrepaid loans, therefore, the variable 'repaid_yn' is transformed into categorical

4) The 'Repaid'-category is removed

5) The resulting NaN-s are dropped

6) A new column 'percentage_nr' is computed by multiplying the values in 'percentage_not_repaid' by 100

7) The variables contained in the groupby-object are depicted in a graph

7.2.1 Borrower Assessment Variables

back to table of contents

1) Percentage of unrepaid loans by interest rate level

→ interest as ordinal variable: 'interest_16l'

The relationship between interest rate and loan outcome status seems roughly quadratic. The highest percentage of not repaid loans is in the 30-31.9% group. In the two highest interest rate goups the percentage of not repaid loans was lower. The most represented interest rate in the population is located in the second to the highest group: 32-33.9%. A good question for multivariate exploration would be to explore the relationship between interest rates, loan purpose and loan outcome status.

2) Percentage of unrepaid loans by level of 'ProsperRating (Alpha)'

back to table of contents

The relationship between 7-level rating and loan outcome status seems almost ideally linear: the percentage of not replaid loans falls consistently from the lowest to the highest rating group.

3) Percentage of unrepaid loans by level of 'ProsperScore'

back to table of contents

The relationship between 10-level rating ('Prosper Score') and loan outcome status is roughly linear, although the trend seems less pronounced in the rating levels 2 (second to worst) to 5 (close to the middle of the scale). The levels from 5 to 10 (medium to best rating) seem to predict the loan outcome almost ideally. Furthermore, there is a pronounced difference between the worst score (1, with 44.32% not repaid loans) and the second to worst score (2, with 32.41% of not repaid loans).

4) Percentage of unrepaid loans by level of professional customer credit rating 'cscore_ordinal'

back to table of contents

The professional credit rating seems to predict loan outcome in an almost ideally linear manner. Borrowers with a rating of 889 or more had a zero default rate.

Correlation of 'status_bin' and credit rating variables (+ interest rate)

back to table of contents

https://www.uvm.edu/~statdhtx/StatPages/More_Stuff/OrdinalChisq/OrdinalChiSq.html

correlational approach vs chi-square

7.2.2 Loan Characteristics

back to table of contents

1) Percentage of unrepaid loans by 'loan_purpose'

'loan_purpose' contains an empty category: personal loans. First, it should be removed. For this purpose, we save a df with only 'repaid_yn' and 'loan_purpose'.

The highest percentage of unrepaid loans was in the categories 'Green Loans' (38.89%), 'Household Expenses' (34.98%), 'Medical/Dental' (33.64%), and 'Baby & Adoption' (32.43%). Business loans are located right after this group of high-risk loans (27.41% not repaid). The lowest-risk loan purposes are as follows: 'Motorcycle' (3.53% not repaid), 'Engagement Ring' (5.46%), 'Recreational Vehicle' (6.25%), and 'Boat' (10.71%). It would be interesting to investigate the relationship between income, loan purpose and interest rate.

2) Percentage of unrepaid loans by levels of loan original amount

back to table of contents

The loan outcome seems unrelated to the loan size. A possible explanation would be that loan size interferes with other variables (loan purpose, interest rate, borrower rating, etc.) and the percentage of not repaid loans per loan size is explained by those variables.

3) Percentage of unrepaid loans by loan term

back to table of contents

The percentage of defaults is the lowest for the shortest loan term (12 months): 4.28%. It rises to 22.58% for the loan term of 36 months and is even higher for the longest term (60 months): 32.46%.

7.2.3 Borrower Characteristics

back to table of contents

1) Percentage of unrepaid loans by levels of 'EmploymentStatus'

The categories 'Not available', while 'Self-employed' and 'Not employed' have only one member each. We are going to remove them next.

The categories of 'Employment Status' unfortunately are not well defined ('Employed' includes 'Full-time' and 'Part-time'; the categories 'Self-emplyed' and 'Not employed' have only 1 member each, which is unusual and means that either data entry or data collection were compromised). On the other hand, the category 'Other', which may include unemployed borrowers, has the highest percentage of not repayed loans: 44.77%. It is therefore questionable whether this variable can contribute to predicting / inferencing about the loan outcome status.

2) Percentage of unrepaid loans by levels of 'EmploymentStatusDuration'

back to table of contents

It appears that duration of employment has no influence over loan outcome, although two of the highest categories (480-504 months and 528-755 months) had the highest rates of not repaid loans: 42.11%. A clear trend is not visible and the relationship between the two variables could be influenced by other variables.

3) Percentage of unrepaid loans by levels of 'IncomeRange'

back to table of contents

Exclude 'Not employed', 'Not displayed', '$0'

There appears to be a clear linear relationship between the borrower annual income and the loan outcome: borrowers with higher annual income are less likely to default.

4) Percentage of unrepaid loans by home ownership (yes or no)

back to table of contents

There is a slighly lower percentage of not repaid loans in the group of borrowers who were home owners. It should be noted however that borrowers with a mortgage are also classified as homeowners, as the variable dictionary states ("A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.").

Correlation of 'status_bin' and loan original amount, employment status duration, income

back to table of contents

7.2.4 Borrower Credit History Variables, Borrower Indebtedness

back to table of contents

1) Percentage of unrepaid loans by credit history in months

A longer credit history seems to predict a higher possibility of default. It could be that for borrowers with shorter credit history other variables play a more important role.

2) Percentage of unrepaid loans by number of current credit lines

back to table of contents

The continuous count variable 'CurrentCreditLines' was transformed into the ordinal variable 'current_credits'.

Borrowers with no current credit lines have a very high percentage of not repaid loans: 60%. This could mean that first-time borrowers are especially risky. For the rest of the categories of 'Current credit lines' no trend is visible.

3) Percentage of unrepaid loans by number of open credit lines

back to table of contents

The continuous count variable 'OpenCreditLines' was transformed into the ordinal variable 'open_credits'.

The variable 'Open credit lines' seems like a copy of 'Current credit lines'.

4) Percentage of unrepaid loans by number of loans for the last 7 years

back to table of contents

The continuous count variable 'TotalCreditLinespast7years' was transformed into the ordinal variable 'total_credits_7y'.

It seems that a higher number of loans a borrower has taken over the last 7 years could predict a lower possibility of default. This is probably due to the fact that borrowers who have not defaulted in the past are still given loans, while borrowers who defaulted, were subsequently rejected.

5) Percentage of unrepaid loans by number of Open Revolving Accounts

back to table of contents

The continuous count variable 'OpenRevolvingAccounts' was transformed into the ordinal variable 'open_accounts'.

Customers with no open revolving accounts had the highest rate of unrepaid loans: 47.7%. For the rest of the categories, the trend is not clear.

6) Percentage of unrepaid loans by amount of Open Revolving monthly payment

back to table of contents

The continuous variable 'OpenRevolvingMonthlyPayment' was transformed into the ordinal variable 'rev_mo_payment'.

Similarly to the number of open revolving accounts, zero payments were related to the highest default rate, whereas for the rest of the categories, no trend was visible.

7) Percentage of unrepaid loans by percentage of revolving utilization ('BankcardUtilization')

back to table of contents

The continuous variable 'BankcardUtilization' was transformed into the ordinal variable 'card_util_rate'.

Customers who had either 0% or 100% or more utilization rate of their credit cards had a high percentage of nor repaid loans (33.1% and 34.18%). For the rest of the borrowers, the probability of default rose slightly from 0.1-25% towards 75-100%.

Correlation of 'status_bin' and credit history (months), current credit lines, open credit lines, number of loans for last 7 years, open revolving accounts, open revolving monthly payments, percentage of revolving utilization, debt-to-income ratio

back to table of contents

8) Percentage of unrepaid loans by number of current delinquencies

back to table of contents

The continuous count variable 'CurrentDelinquencies' was transformed into 'delinquencies' based on the frequency distribution of the latter in the original dataset.

The relationship between percentage of unrepaid loans and number of current delinquencies appears to be quadratic with a peak at the group who had 3 to 5 delinquencies (39.73% not repaid). The percentage in the group with 6 or more delinquencies was lower (35.15%).

9) Percentage of unrepaid loans by amount delinquent on loans

back to table of contents

The continuous variable 'AmountDelinquent' was transformed into the ordinal variable 'amount_delinquent'.

The relationship between the amount of current delinquencies and the default rate appeared to be queadratic, with a peak in the group of \$352 to \\$3,040. However, the groups were formed not in equal intervals (due to skewness) and the real relationship may well be more complicated.

10) Percentage of unrepaid loans by number of delinquencies on loans summed up for the last 7 years

back to table of contents

The continuous count variable 'DelinquenciesLast7Years' was transformed into the ordinal variable 'delinquencies_7y'.

There seems to be no discernible trend in the relationship between default rate and sum of delinquencies for the last 7 years. The relationship may be influenced by interfering variables.

11) Percentage of unrepaid loans by number of inquiries by creditors for the last 6 months

back to table of contents

The continuous count variable 'InquiriesLast6Months' was transformed into the ordinal variable 'inquiries_6m'.

There is a linear trend between the default rate and the sum of creditor inquiries for the last 6 months. The percentage of unrepaid loans rises from 19.5% (no inquiries) to 39.34% (7 or more inquiries).

12) Percentage of unrepaid loans by total number of inquiries by creditors

back to table of contents

The continuous count variable 'TotalInquiries' was transformed into the ordinal variable 'inquiries_total'.

The group with 20 or more inquiries by creditors has the highest percentage of unrepaid loans: 35.29%. For customers with no inquiries the default rate is 19.84%. It increases slightly (by 1% or less until the group with 10-12 inquiries and then drops slightly in the next group (13-19 inquiries). The difference between the lst group and the group with 10-12 inquiries is substantial (35.29% vs. 25.55%). The relationship between the variables may be influenced by interfering variables.

13) Percentage of unrepaid loans by the number of Derogatory Public Records summed up for the last 10 years

back to table of contents

The continuous count variable 'PublicRecordsLast10Years' was transformed into the ordinal variable 'pub_rec_10y'.

There is a clear trend of increase in default rate from borrowers with no derogatory public records during the last 10 years to such with 2 or more (21.84% to 31.51%).

14) Percentage of unrepaid loans by the number of Derogatory Public Records summed up for the last 12 months

back to table of contents

The continuous count variable 'PublicRecordsLast12Months' was transformed into the ordinal variable 'pub_rec_12m'.

Customers with 1 or more derogatory public records for the last 12 months had a substantially higher default rate than customers with no records: 36.8% vs. 22.81%.

15) Percentage of unrepaid loans by levels of debt to income ratio

back to table of contents

The continuous variable 'DebtToIncomeRatio' was transformed into the ordinal variable 'dti_ordinal'.

The relationship between DTI an default rate could be construed as quadratic or approximately linear: the default rate grows from 0-10% towards 200% or more. The growth rate is not even, however, the intervals of DTI are also not even due to high skewness.

Correlation of 'status_bin' and current delinquencies, amount delinquent, number of delinquencies for last 7 years, number of inquiries by creditors for last 6 months, total number of inquiries by creditors, number of Derogatory Public Records summed up for the last 10 years, number of Derogatory Public Records summed up for the last 12 months

back to table of contents

7.2.5 Chi-Square Test of Independence between Loan Outcome and Ordinal Predictor Variables

back to table of contents

chisq1_df: 'repaid_yn' + Borrower Assessment + Loan Characteristics + Borrower Characteristics

chisq2_df: repaid_yn' + Credit History + Open Revolving Accounts + Revolving Utilization

       + Delinquencies + Inquiries by Creditors + Derogatory Public Records + Borrower Indebtedness

7.3. Relationship of Loan Outcome Status to Predictor Variables Grouped by Loan Purpose

back to table of contents

In this section, we are going to present the group difference of repaid vs. unrepaid loans on interval-level variables, split by loan purpose & the same with ordinal vars, with function from part 7.2., defined with groupby obj with 2 vars

The loan purpose variable has the following issues:

1) The "personal loan" category is empty and

2) the "not available" category does not contain unrepaid loans.

3) The "recreational vehicle" contains a very low number of unrepaid loans.

Define function for boxplots for most predictor variables:

Define function for boxplots for count predictor variables (with extreme outliers):

Define function for bar charts depicting percentage of unrepaid loans in combinations of categorical/ordinal predictor variables:

7.3.1 Borrower Assessment Variables

back to table of contents

1) Interest rate x loan purpose x loan outcome

Regarding the effect of interest rate (split by loan purpose) on the loan status outcome, there is a clear trend of unrepaid loans having been approved at a higher interest rate. The exceptions are loans for cosmetic procedures and "green" loans.

2) Prosper Rating (7-level rating) x loan purpose x loan outcome

back to table of contents

Regarding the effect of 7-level borrower rating (also named "Prosper Rating Alpha", split by loan purpose) on the loan outcome, there is a somewhat clear trend of unrepaid loans belonging to borrowers with lower rating. The exceptions are loans for cosmetic procedures and "green" loans.

3) Prosper Score (10-level rating) x loan purpose x loan outcome

back to table of contents

Regarding the effect of 10-level borrower rating (also named "Prosper Score", split by loan purpose) on the loan outcome, there is a clear trend of unrepaid loans belonging to borrowers with lower rating. The exceptions are loans for cosmetic procedures and weddings, where both repaid and unrepaid loans were lent to borrowers with overall equal ratings, and "green" loans, where the borrowers who repaid their loans had a (groupwise) lower rating than those who did not repay them.

4) CreditScoreRangeUpper (professional credit rating) x loan purpose x loan outcome

back to table of contents

Regarding the effect of professional credit score (split by loan purpose) on the loan outcome, there is a somewhat clear trend of unrepaid loans belonging to borrowers with lower rating. The exceptions are student use loans, engagement ring loans, and medical and dental loans.

7.3.2 Loan Characteristics

back to table of contents

1) LoanOriginalAmount x loan purpose x loan outcome

The loan amount variable is characterized by positive outliers in almost every category of loan purpose. There is no clear trend regarding the medians (or the bulk of the scores) of repaid vs. unrepaid loans.

2) Loan term (ordinal) x loan purpose x loan outcome

back to table of contents

Short-term loans (12 months) in about half of categories of loan purpose contain no unrepaid loans. The loan outcome clearly depends on the loan term: the longer the term, the highest the probability of default.

7.3.3 Borrower Characteristics

back to table of contents

1) StatedMonthlyIncome x loan purpose x loan outcome

The loan purpose category 'Debt Consolidation' (repaid loans) contains some extreme high outliers in Stated Monthly Income. In order to make the graph more comprehensible, we can filter them out.

There are two extreme cases: 17411 and 8066. We can exclude them and redo the graph.

Redo graph: StatedMonthlyIncome (minus 2 highest outliers) x loan purpose x loan outcome

Generally, the median stated monthly income of borrowers of unrepaid loans is lower for almost all categories of loan purpose. The exceptions are: loans for "Student Use", "Boat", and "Recreational Vehicle". The variable "Stated Monthly Income" has many high outliers. The three highest outliers are situated in the loan purpose category of "Debt Consolidation". Other loan purpose categories with high outliers in stated monthly income are business loans, wedding loans, "home improvement" - loans, and the category "Other".

2) EmploymentStatusDuration x loan purpose x loan outcome

back to table of contents

The Employment Status Duration variable is characterized by positive outliers in almost every category of loan purpose. There is no clear trend regarding the medians (or the bulk of the scores) of repaid vs. unrepaid loans.

3) Housing

back to table of contents

Generally, borrowers who default tend to not be homeowners. The opposite is true for two loan purpose categories: 'Baby and Adoption' and 'Vacation'.

7.3.4 Borrower Credit History Variables, Borrower Indebtedness

back to table of contents

1) CurrentCreditLines x loan purpose x loan outcome

'current_credits' ordinal

back to table of contents

2) OpenCreditLines x loan purpose x loan outcome

back to table of contents

'open_credits' ordinal

back to table of contents

3) TotalCreditLinespast7years x loan purpose x loan outcome

back to table of contents

'total_credits_7y' ordinal:

back to table of contents

4) OpenRevolvingAccounts x loan purpose x loan outcome

back to table of contents

'open_accounts' ordinal:

back to table of contents

5) OpenRevolvingMonthlyPayment x loan purpose x loan outcome

back to table of contents

'rev_mo_payment' ordinal:

back to table of contents

6) CurrentDelinquencies x loan purpose x loan outcome

back to table of contents

current delinquencies as ordinal var:

back to table of contents

The total number of delinquencies ?????????????????????? seems to be a strong predictor for unrepaid loans in the purpose categories of "business loans", 'debt consolidation loans", "wedding loans", "home improvement loans", "auto loans". Therefore, the loan purpose category interferes with the overall effect of the predictor "total number of delinquencies".

7) DelinquenciesLast7Years x loan purpose x loan outcome

back to table of contents

delinquencies over 7 years as ordinal var:

back to table of contents

There is no obvious linear trend in the effect of the sum of delinquencies for the past 7 years on the percentage of unrepaid loans split by loan purpose categories.

8) AmountDelinquent x loan purpose x loan outcome

back to table of contents

AmountDelinquent as ordinal var:

back to table of contents

There is no obvious linear trend in the effect of "amount delinquent" on percentage of unrepaid loans split by loan purpose categories.

9) InquiriesLast6Months x loan purpose x loan outcome

back to table of contents

InquiriesLast6Months as ordinal var:

back to table of contents

The sum of inquiries by creditors for the past 6 months seems to be a good predictor of the percentage of unrepaid loans in the following loan purpose categories: debt consolidation, home improvement, auto, other. There is therefore an interaction between the sum of inquiries (6 months) and loan purpose as predictors of loan outcome.

10) TotalInquiries x loan purpose x loan outcome

back to table of contents

TotalInquiries as ordinal var:

back to table of contents

There is no obvious linear trend in the effect of the sum of inquiries by creditors (total) on the percentage of unrepaid loans split by loan purpose categories.

11) PublicRecordsLast12Months x loan purpose x loan outcome

back to table of contents

PublicRecordsLast12Months as ordinal var:

back to table of contents

It seems that the sum of derogatory public records for the last 12 months would only be a useful predictor in certain subsets of the data, such as the loan purpose categories of business and debt consolidation.

12) PublicRecordsLast10Years x loan purpose x loan outcome

back to table of contents

PublicRecordsLast10Years as ordinal var:

back to table of contents

The effect of the predictor "sum of derogatory public records for the past 10 years" seems to interfere with the effect of the loan purpose category (the linear effect if more or less pronounced in the different loan purpose categories).

13) BankcardUtilization x loan purpose x loan outcome

back to table of contents

'card_util_rate' ordinal:

back to table of contents

14) credit_history_months x loan purpose x loan outcome

back to table of contents

The credit history lenght variable contains positive outliers in most categories of loan purpose. There is no clear trend regarding the outcome variable, repaid vs. unrepaid loans.

15) DebtToIncomeRatio x loan purpose x loan outcome

back to table of contents

dti ordinal:

back to table of contents

When the cases are grouped by loan purpose and level of debt-to-income ratio, several groups have extremely high percentage of unrepaid loans: 1) loans for large puchases taken on by borrowers with DTI of 50-100% and 200% or more; 2) business loans given to borrowers with 100-200% DTI; 3) loans for household expenses given to borrowers with 100-200% DTI; 4) loans for home improvement lent to borrowers with 200% or more DTI. The is no consistent pattern of increase in the percentage of unrepaid loans with the increase of DTI. Rather, each loan purpose category has its own specific pattern, or, in other words, there is an interaction between the predictors loan purpose and DTI regarding their effect on the percentage of unrepaid loans.

8. Conclusions

back to table of contents

Generally, a statistical model can be defined as a prediction model, an inference model, or a combination of the two. A prediction model is focused on how to best predict an outcome based on a combination of predictors without looking too much into the relationships among the predictors. An inference model, on the other hand, is also focused on the way predictors influence each other as well as on the influence of each separate predictor variable on the outcome James, G. et al., 2021 5

Including all of the predictor variables in a regression model of the Prosper data proved to be difficult.

The first issue we encountered was the variety of non-normal distributions of the continuous variables. Furthermore, many of these variables were count data containing a lot of zeros.

The zeros together with multimodal and strongly skewed distributions (which were visibly not resembling a log-normal distribution) precluded log-transformation. A possible solution would be to use a specific regression model for count data which is often encountered in life science 4.

However, the aim of an exploratory study like this one is to gain insight into the relationships among variables. We have built neighter a prediction nor an inference model here. Instead, continuous (and discrete) variables were transformed into ordinal variables and the percentage of unrepaid loans was computed for each level of each variable. Plotting the percentage proved to be very illustrative and allowed us to classify the variables as possible strong or weak predictors of loan outcome status.

The transformation was performed either by using percentiles or equal intervals, depending on the distribution of the rough data.

In addition, there are unordered categorical variables, such as loan purpose and homeownership which influence the ordered ones.

Correlation measures such as Pearson or Spearman are not suitable for proving the relationship between loan outcome and predictors, because of the assumption of linearity (Spearman uses ranked scores, but applies the same assumption to them). A chi-square test of independence proved that the hypothesis of independence of the outcome and most of the predictor variables can be rejected (i.e., most chi-square tests were statistically significant, except two: loan outcome vs. credit history lenght and loan outcome vs. employment status duration). However, in order to make complete sense of these results, a correction of the p-value for multiple comparisons should be applied. Due to the exploratory aim of this study, we refrained from applying such correction.

The following classification into 'weak' and 'strong' predictors is based only on inspection of plots.

Strong predictors:

Weak predictors:

Unordered categorical variables:

Strong predictors of default would be redundant in a prediction model, especially the different credit rating variables (all of which are strongly correlated to the interest rate, as theoretical models propose). Deciding which of them to include into a model would go beyond the score of this discussion. The 'weak' predictors, on the other hand, could have important influence as part of a regression model. A necessary preparatory step of such a model would be to decide on a suitable transformation method. Unordered categorical variables, as, for instance, the loan purpose should also be included in a prediction model. Plotting the combinations of loan purpose with other variables showed that certain categories of loan purpose are clearly distinct in terms of default rate, interest rate, loan size, income and so forth.

[4](#countdata) How to Deal with Count Data

[5](#statlearning) James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning. New York: Springer.